<html>

<head>
<title>Overrepresented Sequences</title>
<style type="text/css">
	body {
		font-family: sans-serif;
	}
</style>
</head>
<body>
<h1>Overrepresented Sequences</h1>
<h2>Summary</h2>
<p>
A normal high-throughput library will contain a diverse set
of sequences, with no individual sequence making up a tiny
fraction of the whole.  Finding that a single sequence is very
overrepresented in the set either means that it is highly 
biologically significant, or indicates that the library is
contaminated, or not as diverse as you expected.
</p>
<p>
This module lists all of the sequence which make up more than
0.1% of the total.  To conserve memory only sequences which
appear in the first 200,000 sequences are tracked
to the end of the file.  It is therefore possible that a sequence
which is overrepresented but doesn't appear at the start of the file
for some reason could be missed by this module.
</p>
<p>
For each overrepresented sequence the program will look for matches
in a database of common contaminants and will report the best hit
it finds.  Hits must be at least 20bp in length and have no more
than 1 mismatch.  Finding a hit doesn't necessarily mean that this
is the source of the contamination, but may point you in the right
direction.  It's also worth pointing out that many adapter sequences
are very similar to each other so you may get a hit reported which
isn't technically correct, but which has very similar sequence to
the actual match.
</p>
<p>
Because the duplication detection requires an exact sequence match over
the whole length of the sequence any reads over 75bp in length are truncated
to 50bp for the purposes of this analysis.  Even so, longer reads are more
likely to contain sequencing errors which will artificially increase the
observed diversity and will tend to underrepresent highly duplicated sequences.
</p>


<h2>Warning</h2>
<p>
This module will issue a warning if any sequence is found to represent
more than 0.1% of the total.
</p>

<h2>Failure</h2>
<p>
This module will issue an error if any sequence is found to represent
more than 1% of the total.
</p>

</body>
</html>
